From datasets to resultssets in Stata

نویسنده

Roger Newson

چکیده

In general, a Stata dataset, or a dataset in any other format, should contain one observation per thing, and data on attributes of things. For instance, in the medical sector, a dataset might have one observation per patient, and data on the patient’s baseline characteristics. Alternatively, a dataset might have one observation per visit to a health centre, and data on the state of the patient who made the visit. Usually, the variables in a dataset are either primary key variables, which identify the things corresponding to the observations, or non-key variables, which identify interesting attributes of those things. For instance, if there is one observation per patient, then one of the variables is usually a patient ID number, which identifies the observation uniquely. Or, if there is one observation per visit, and a patient may only make one visit per day, then the primary key variables are usually patient ID and visit date. Statisticians (and other data analysts) are typically provided with (or collect for themselves) a dataset with one observation per “experimental or observational unit”, where a unit may be a patient, a patientday, a country-year, a car model, or some other thing. However, they are typically paid to produce plots for presentation, or tables for publication. To do this, they really need datasets with one observation per plotted data point, or per Y -axis label, or per X-axis label, or per table row. Axis labels and table rows do not often correspond to the original units in the original dataset. For instance, in the auto data, shipped with official Stata, there is one observation per car model, and the primary key is the single variable make. Figure 1 gives confidence intervals from a regression model measuring differences in fuel efficiency (in miles per gallon) in cars from inside and outside the US, with various 1978 repair records, compared with a “reference car”, made by a US company and with a repair record of 3. It was created using the eclplot package, which creates confidence interval plots, and requires, as input, a dataset with one observation per confidence interval to be plotted and data on estimates and confidence limits. Table 1 gives the same data (plus P -values) as a table. It was created using the listtex package, which creates tables which can be cut and pasted into a TEX, LTEX, HTML or word processor document, and requires, as input, a dataset with one observation per table row. Both eclplot and listtex are downloadable from SSC, but neither can use directly as input the original auto data, with one observation per car model. Instead, resultsset-generating and resultsset-processing packages are used, taking as input the original auto data, and creating, as output, datasets that can be input to eclplot and listtex. This survey explains how to do this, using example do-files distributed with this document on the conference website at http://www.stata.com/support/meeting/10uk/ . These do-files will work under Stata 8 if the user has installed the required unofficial Stata packages, listed at the top of each do-file. These packages are downloadable from the SSC archive at http://ideas.repec.org/s/boc/bocode.html using the Stata ssc command. The full list of required packages comprises the resultsset-generating packages descsave, parmest, xcollapse and xcontract, together with the resultsset-processing packages eclplot, listtex, dsconcat, sencode, sdecode, factext, factmerg and ingap, and the estimation package somersd.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatio-temporal variability of aerosol characteristics in Iran using remotely sensed datasets

The present study is the first attempt to examine temporal and spatial characteristics of aerosol properties and classify their modes over Iran. The data used in this study include the records of Aerosol Optical Depth (AOD) and Angstrom Exponent (AE) from MODerate Resolution Imaging Spectroradiometer (MODIS) and Aerosol Index (AI) from the Ozone Monitoring Instrument (OMI), obtained from 2005 t...

متن کامل

Automatic segmentation of glioma tumors from BraTS 2018 challenge dataset using a 2D U-Net network

Background: Glioma is the most common primary brain tumor, and early detection of tumors is important in the treatment planning for the patient. The precise segmentation of the tumor and intratumoral areas on the MRI by a radiologist is the first step in the diagnosis, which, in addition to the consuming time, can also receive different diagnoses from different physicians. The aim of this study...

متن کامل

Representing Spectral data using LabPQR color space in comparison to PCA method

In many applications of color technology such as spectral color reproduction it is of interest to represent the spectral data with lower dimensions than spectral space’s dimensions. It is more than half of a century that Principal Component Analysis PCA method has been applied to find the number of independent basis vectors of spectral dataset and representing spectral reflectance with lower di...

متن کامل

A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts

High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...

متن کامل

Clustering of Fuzzy Data Sets Based on Particle Swarm Optimization With Fuzzy Cluster Centers

In current study, a particle swarm clustering method is suggested for clustering triangular fuzzy data. This clustering method can find fuzzy cluster centers in the proposed method, where fuzzy cluster centers contain more points from the corresponding cluster, the higher clustering accuracy. Also, triangular fuzzy numbers are utilized to demonstrate uncertain data. To compare triangular fuzzy ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

From datasets to resultssets in Stata

نویسنده

چکیده

منابع مشابه

Spatio-temporal variability of aerosol characteristics in Iran using remotely sensed datasets

Automatic segmentation of glioma tumors from BraTS 2018 challenge dataset using a 2D U-Net network

Representing Spectral data using LabPQR color space in comparison to PCA method

A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts

Clustering of Fuzzy Data Sets Based on Particle Swarm Optimization With Fuzzy Cluster Centers

عنوان ژورنال:

اشتراک گذاری